EDA: Vancouver Street Trees

Introduction

I will be performing an exploratory data analysis (EDA) for the Vancouver Street Trees dataset located here. This dataset contains public trees on boulevards, detailing their coordinates, species, and other characteristics. It excludes park and private trees. The dataset is updated daily on weekdays, the data reflects changes based on priorities and resources, sometimes taking years for certain updates. I'm exploring this dataset because I want to gain an insight on the current generation condition and state of Vancouver trees.

Questions to explore

  1. Which Vancouver neighbourhood has the most trees?
  2. What is Vancouver's top 10 most common tree species?
  3. Which neighbourhood has the tallest trees?
  4. Which neighbourhood has the most cherry blossom trees?
  5. Is there a relationship between tree height and diameter?
  6. What is the geographic distribution of trees in density?
  7. What is the range and distribution of all numerical columns.
  8. What is the range and distribution of all categorical columns.
  9. What is the relationship between numerical and categorical columns?

Description & Review of Data

Columns of Interest:

Data Quality Reivew

I'll be importing the raw data and extracting some additional information about the tables. Specifically, I want to know which data types are used and if there are any gaps (nulls) in the data. Firstly, we'll get a snapshot of how the dataset looks like and see if there are any patterns to the missing data.

This observation sees 5000 data entries, 21 total columns consisting of 8 numerical columns (float64 and int64). We're seeing some columns doesn't have 5000 entries indicating there are null values. Let's observe how many null values there are.

To answer my questions, I will need the following columns. Since date_planted is missing half of the dates, it won't be useful for analysis. However, I will keep it for now to see if there are any patterns to the missing data.

Exploratory Visualizations

Visualizing missing values

Is there a pattern to the missing data? There are 5000 data entries in this sample dataset. Columns "date_planted" and "cultivar_name" have a significant amount of missing data of about 50% recorded rate. I will now create a visualization of the missing values to help us identify possible trends.

There are too missing date_planted and cultivar_name data in this dataset. These columns would not be useful for analysis. However, we can try to see if there are any patterns with these missing data.

The frequency plot doesn't show any trends with the other missing data. However, it's interesting to observe the scatterplot showing low data inputs for the June, July, August, and September months. This "anomoly" may make sense knowing trees are typically planted in the fall or spring season.

It is observed Prunus and Acer two two most common species for trees with missing dates. We will see later on if this is an anomoly when compared to all the Vancouver trees.

Questions 1: Which Vancouver neighbourhood has the most trees?

We would want to overlay this chart with the "Number of Trees by Neighborhood with Unknown Planting Date" chart to look for anomolies with the missing data.

Question 2: What is Vancouver's top 10 most common tree?

The most common tree species in Vancouver is the Serrulata, also known as Japanese cherry. The most common name tree is the Kwanzan Flowering Cherry.

Question 3: Which neighbourhood has the tallest trees?

This shows the tallest tree height for each neighbourhood. West Point Gret, Grandiew-Woodland, Kensington-Cedar Cottage, Shaughnessy, Kitsilano, and Renfrew Collingwood have the tallest trees in Vancouver. Since cherry blossom trees are the most common tree in Vancouver, we will focus our analysis based on just the Japanese Cherry blossom trees.

Question 4: Which neighbourhood has the most cherry blossom trees?

The Mount Pleasant neighbourhood has the most Japanese Cherry trees.

Question 5: Is there a relationship between tree height and diameter?

For all trees:

There is a clear positive trend between the diameter of the tree and its height.

For cherry trees:

Cherry trees also show a clear positive trend between the diameter of the tree and its height, as expected.

Here we can that most shorter cherry trees (height range = 1) have a diameter of 4. We cannot draw a conclusion for cherry trees with height 4, because it doesn't have enough data as the diameter is cut at 20 and 33.

Question 6: What is the geographic distribution of trees in density?

These heap maps shows us where the trees are bunched up in Vancouver. It shows how the trees are geographically distributed by creating a tree density map.

This scatter plot shows a "map" of Vancouver. It can better show us the patches of area without trees, and also shows how the trees are geographically distributed in height.

This scatter plot shows how the trees are geographically distributed in diameter.

Question : What is the range and distribution of all numerical columns?

The dataset has too many NaN values for date_planted, so this column of data is unreliable to calculate Vancouver distribution of tree age. However, we're able to see a clear range for the tree's diameter and height range.

According to the dataset schema, the height range is measured in feet. 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and 10 = 100+ ft).

Diameter of tree is measured as DBH in inches (DBH stands for diameter of tree at breast height).

This shows the distribution of trees' diameter and height, as well as the location (latitude and longitute) distribution of the trees.

Question 8: What is the range and distribution of all categorical columns?

The most popular species is Cerasifera. The neighbourhood with the most trees is Renfrew-Collingwood. The most popular genus is Prunus. The most popular tree is the Kawanzan Flowering Cherry. Most trees do not have a root barrier.

Question 9: What is the relationship between numerical and categorical columns?

Explorating this relationship between the numerical and categorical data gives us interesting results. The genus Acer shows significant variability in both diameter and height, suggesting that this genus includes both small and large species. Norway Maple and Kwanzan Flowering Cherry have a broader range in diameter and height, which could indicate these species are popular and planted in various conditions.

Concluding Remarks

I will be including 4 graphs in my Vancouver Tree analysis project. The graphs will be redone with suitable design choices for easy visualization for the audience. I would want to use the actual Vancouver tree dataset instead of the sample 5000 entries dataset. I want to show the accurate results from the real data.

The audience will not need to see graphs with too many analytical values such as mean, medium, and boxplots because the report is meant for the general public to see and understand easily.